193 research outputs found
ArchiveSpark: Efficient Web Archive Access, Extraction and Derivation
Web archives are a valuable resource for researchers of various disciplines.
However, to use them as a scholarly source, researchers require a tool that
provides efficient access to Web archive data for extraction and derivation of
smaller datasets. Besides efficient access we identify five other objectives
based on practical researcher needs such as ease of use, extensibility and
reusability.
Towards these objectives we propose ArchiveSpark, a framework for efficient,
distributed Web archive processing that builds a research corpus by working on
existing and standardized data formats commonly held by Web archiving
institutions. Performance optimizations in ArchiveSpark, facilitated by the use
of a widely available metadata index, result in significant speed-ups of data
processing. Our benchmarks show that ArchiveSpark is faster than alternative
approaches without depending on any additional data stores while improving
usability by seamlessly integrating queries and derivations with external
tools.Comment: JCDL 2016, Newark, NJ, US
To Index or Not to Index: Optimizing Exact Maximum Inner Product Search
Exact Maximum Inner Product Search (MIPS) is an important task that is widely
pertinent to recommender systems and high-dimensional similarity search. The
brute-force approach to solving exact MIPS is computationally expensive, thus
spurring recent development of novel indexes and pruning techniques for this
task. In this paper, we show that a hardware-efficient brute-force approach,
blocked matrix multiply (BMM), can outperform the state-of-the-art MIPS solvers
by over an order of magnitude, for some -- but not all -- inputs.
In this paper, we also present a novel MIPS solution, MAXIMUS, that takes
advantage of hardware efficiency and pruning of the search space. Like BMM,
MAXIMUS is faster than other solvers by up to an order of magnitude, but again
only for some inputs. Since no single solution offers the best runtime
performance for all inputs, we introduce a new data-dependent optimizer,
OPTIMUS, that selects online with minimal overhead the best MIPS solver for a
given input. Together, OPTIMUS and MAXIMUS outperform state-of-the-art MIPS
solvers by 3.2 on average, and up to 10.9, on widely studied
MIPS datasets.Comment: 12 pages, 8 figures, 2 table
Estimating and Explaining Model Performance When Both Covariates and Labels Shift
Deployed machine learning (ML) models often encounter new user data that
differs from their training data. Therefore, estimating how well a given model
might perform on the new data is an important step toward reliable ML
applications. This is very challenging, however, as the data distribution can
change in flexible ways, and we may not have any labels on the new data, which
is often the case in monitoring settings. In this paper, we propose a new
distribution shift model, Sparse Joint Shift (SJS), which considers the joint
shift of both labels and a few features. This unifies and generalizes several
existing shift models including label shift and sparse covariate shift, where
only marginal feature or label distribution shifts are considered. We describe
mathematical conditions under which SJS is identifiable. We further propose
SEES, an algorithmic framework to characterize the distribution shift under SJS
and to estimate a model's performance on new data without any labels. We
conduct extensive experiments on several real-world datasets with various ML
models. Across different datasets and distribution shifts, SEES achieves
significant (up to an order of magnitude) shift estimation error improvements
over existing approaches.Comment: Accepted to NeurIPS 202
- …